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Abstract: In statistical exercises where there are several candidate models, 
the traditional approach is to select one model using some data driven crite- 
rion and use that model for estimation, testing and other purposes, ignoring 
the variability of the model selection process. We discuss some problems asso- 
ciated with this approach. An alternative scheme is to use a model-averaged 
estimator, that is, a weighted average of estimators obtained under different 
models, as an estimator of a parameter. We show that the risk associated with a 
Bayesian model-averaged estimator is bounded as a function of the sample size, 
when parameter values are fixed. We establish conditions which ensure that a 
model-averaged estimator's distribution can be consistently approximated us- 
ing the bootstrap. A new, data-adaptive, model averaging scheme is proposed 
that balances efficiency of estimation without compromising applicability of 
the bootstrap. This paper illustrates that certain desirable risk and resam- 
pling properties of model-averaged estimators are obtainable when parameters 
are fixed but unknown; this complements several studies on minimaxity and 
other properties of post-model-sclcctcd and model-averaged estimators, where 
parameters are allowed to vary. 
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1. Introduction 



In typical statistical applications, it is rare that a precise model is available to fit 
to the data. Selecting one model from several competing models is often the first 
step in the process. However, in the subsequent analysis, it is common to ignore the 
variability in the initial model selection. Two of the many consequences of ignoring 
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modeling variability are (i) under-estimation of the variability of estimators and 
predictors, and (ii) erroneous inference and prediction, resulting from incorrectly 
computing the distributions of estimators and predictors. An alternative to selecting 
a model first and then computing an estimator under that model is to consider 
several models and appropriately average the estimators computed under these 
models. 

Several studies have been published recently on the properties of post-modcl- 
selected and model-averaged estimators; see for example, [8], [23] and [24]. These 
studies are discouraging as they show that many nice properties associated with 
estimators under a known model vanish when there is model uncertainty. For exam- 
ple, Yang [23] shows that consistent model selection/averaging, and minimax-rate 
optimality cannot be simultaneously obtained. The review of Leeb and Potscher [8] 
contains a discussion of several other problems with inference after model selection. 

In view of these negative results, it seems desirable to scale down our expectations 
while working under model uncertainty, and strive for positive, if weaker, results. 
This may be achieved in one of two ways: we may either impose less stringent 
conditions on our estimators, or we may relax the criterion by which an estimator 
is evaluated. The latter is the goal of the present study. 

The computation of an estimator is generally one of the early steps in a sta- 
tistical exercise. Estimators of parameters arc used for various purposes, notably 
for quantifying evidence for or against scientific hypotheses, obtaining interval es- 
timates for the parameter under consideration, for prediction and forecasting, and 
for quantifying the accuracy of predictions and forecasts. These applications require 
knowledge about the distribution of the estimator, and knowledge about the risk 
associated with the usage of such estimators. In this paper, we concentrate on the 
risk behavior of a model-averaged estimator, and on approximating the distribution 
of a model-averaged estimator using the bootstrap. 

In the first part of our study we show that under the traditional frequcntist as- 
sumption that the parameters are fixed but unknown constants, the mean squared 
error in regression estimation under consistent model selection/ averaging is bound- 
ed as a function of sample size. This complements Yang [23], where it was shown that 
a similar quantity cannot achieve minimax-rate optimality. Several of the negative 
results, including those of Yang [23], arise when a parameter is a known constant 
in a smaller model, while it is allowed to vary in a local neighborhood of that 
constant in a larger model. Recently, Hjort and Claeskens [5] studied model aver- 
aged estimators under a local parameter framework. Local parameters are ideal for 
mathematical development, but they are not reflective of statistical reality; see [17]. 
Indeed, as Hjort and Claeskens themselves remark in the rejoinder to the discus- 
sion of their paper, "a too literal belief in sample-size-dcpcndcnt parameters would 
clash with Kolmogorov consistency and other requirements of natural statistical 
models." [5]. In view of this, it is meaningful to verify that estimators have rea- 
sonable risk behavior under consistent model selection/averaging when parameters 
are fixed constants. Our result also implies that integrated risks under consistent 
model selection/averaging are bounded, when integrals are taken with respect to 
any probability measure on the parameter space that does not depend on sample 
size. 

In the second part of our study, in addition to the assumption that the parame- 
ters are fixed but unknown constants, we also weaken the consistency requirement 
of the model averaging procedure. In the terminology of Yang [23], a model selec- 
tion/averaging scheme is consistent if it is asymptotically degenerate at the true 
model, when the true model is one of the candidate models. When the models arc 
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nested and several of them can correctly describe the data generation process, the 
most parsimonious correct model is taken as the true model. We call this strong 
consistency. We define a model selection/averaging scheme as weakly consistent if it 
selects or averages over all candidate models that correctly describe the data gen- 
eration process. When only one model is correct, the strong and weak consistency 
requirements are identical; but if models are nested and several of them are cor- 
rect, a weakly consistent scheme may distribute weights among all of them while a 
strongly consistent one is asymptotically degenerate at the smallest one. Recently, 
Leung and Barron [11] proposed a scheme of model averaging that results in nice 
risk behavior. Their scheme is an example of a weakly consistent procedure. We 
show that a particular choice of a weakly consistent model-averaged estimator has 
a distribution that can be approximated using the bootstrap. 

In Section 2 we propose a simple linear regression model framework to study 
model uncertainty. We also discuss some of the properties of post-model-selection 
estimators that make them unsuitable for further applications, and also some prop- 
erties of model-averaged estimators. This is followed in Section 3 with a discussion 
of mean squared error of the Bayesian model-averaged estimator. In Section 4 we 
propose a new adaptive, model-averaged estimator whose distribution may be con- 
sistently approximated using the bootstrap. A simulation example is discussed in 
Section 5. Finally, in Section 6 we discuss some aspects of our results, and point to 
some open issues relating to model uncertainty 



2. Issues with model selection or averaging 

We select a simple regression framework for our study, which is the same as that 
used by [8], and similar to that of [24]. The observed data {(Y t , x t = (xti, Xt 2 ) T ), t = 
1, . . . , n}, are modeled as 

(2.1) Y t = axn + (3x t2 + e t , 

where the et's are independent, identically distributed -/V(0,cr 2 ), a 2 known. The 
design matrix X with rows given by = (xn,xt2) is non-random. Wc denote 
the two columns of X as X\ and X 2 , the vector of errors as e, and the vector 
of observations as Y. The inner products and norms used below are the usual 
Euclidean ones. The notation D is used for the determinant of the design matrix, 
thus D = ||Xi|| 2 ||X2|| 2 — < Xi,X 2 > 2 . The unknown parameters in this model 
are (a,f3). Model uncertainty surrounds the issue of whether or not f3 = 0. In this 
paper, for ease in presentation, we consider the problem of estimation of a. 

Wc make the standard assumption that n^ 1 X T X — > Q for a positive definite 
matrix Q. This, in particular, implies the standard design conditions 

(2.2) ILYrH 2 = 0(n), \\X 2 \\ 2 = 0(n), 

(2.3) <X U X 2 > = 0(n), D=\\X 1 \\ 2 \\X 2 \\ 2 - <X u X 2 > 2 =0(n 2 ). 

We also assume that n^ 1 < Xi,X 2 >-/-^ Oasrn 00, since without this restriction 
the effect of model uncertainty vanishes in this framework. 
The true model, called Mo, may be described as 

, , _ J U (unrestricted) if j3 7^ 0; 
= \ R (restricted) if = 0. 
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Under U, we adopt the ordinary least squares or maximum likelihood esti- 
mators (a,/3) = (X T X) _1 X T Y. Our notation for these are (a(U), 0(U)). Un- 
der R, (3{R) = 0, and the ordinary least squares or maximum likelihood esti- 
mator for a is a(R) = [J2 i x 2 i ]~ 1 J2 x nyi- Define V\ = cr" 1 1 1 1 — 1 < Xi,e > 
and V 2 = a^D^^WXxW {< X 2 ,e > -||Xi|| _2 < X U X 2 >< X u e >}, thus V = 
(Vi, V 2 ) T ~ N (0, 12). In terms of V, the estimators are 

&(R)\ [ a + (3\\X 1 \\~ 2 <X 1 ,X 2 >+a\\X 1 \\- 1 V u 
a{U) \ = a + aWX^V^aWX^D- 1 / 2 < X 1 ,X 2 >V 2 , 
0(U) J [ p + aWX^D-^Vz. 

The dichotomy between the bias of the restricted model R and the variance of the 
unrestricted model U can be clearly seen in the above formula. The restricted model 
estimator a(R) has a bias factor /3||Yi|| -2 < X\,X 2 >, which vanishes under R, 
while a(U) has an extra factor of cr||Xi|| _1 L) _1 / 2 < X\,X 2 > V 2 that inflates its 
variance relative to a(R). Hence, model selection or model averaging is essentially 
a process of balancing bias and variance; see [20]. 

Let (7/3 be the standard deviation of f3(U). This is a non-random, known number 
depending on a 2 and X. The following model selection criterion is used: 




U if \n-^ 3 <Tp l ${U)\ > c; 
R if |n- 1 / 2 crr 1 /3(C/)| < c. 



The above criterion may be identified as representative of standard model selection 
tools, in the simple regression model. In particular, the above criterion is the tra- 
ditional pre-test procedure based on the likelihood ratio, coincides with the Akaike 
Information Criterion (AIC) if c = y/2, and coincides with the Bayesian Infor- 
mation Criterion (BIC) if c = ^/\ogn. The post-modcl-sclcction estimator of a 
is 



(2.4) a = &{R)I {A=R} + a(U)I { M= uy 

Several nice properties are known about M and, consequently, it is generally 
believed that a will also have good properties. Some of the important properties 
include that for all (3 and as c -> 00, n~ 1 / 2 c -> 0, P[M = M ] -> 1, {M = M } C 
{a = a(M )} and thus P[a = a(M )] -> 1 (see [15]). Note that a(M ) is the 
"oracle's guess" about a, and is not a statistic, since it is based on the knowledge of 
/3. The above properties tend to give the impression that a is a very good estimator. 

However, there are some major problems since the above results are asymptotic in 
nature, and the asymptotics can take a long time to kick in, as well as be dependent 
on the value of f3. Our primary reference for this model and its basic properties [8] 
identifies this as a problem of non-uniformity in (3 of the convergence of M and a. 
It can be immediately seen that the estimator a is super-efficient when c — > 00, 
c/^/n — ► 0, as with BIC. The major repercussions of super-efficiency of a and the 
non-uniformity of its asymptotics is in its risk performance, and in its finite sample 
behavior. The mean squared error of a is unbounded and depends on [3, while that 
of 6c(Mq) is a constant. As a consequence, the finite sample behavior of a is erratic 
and can be quite unlike its asymptotic approximation. Available simulations confirm 
this; see [8]. Several other studies conducted by Leeb, Potscher, Yang and others 
reveal how and why the properties of a and a(Mo) differ. For further information 
see, for example, [6, 7, 8, 9, 10, 22, 24, 25]. 
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The super-efficiency of a results in most variations of the bootstrap being in- 
applicable. Only subsampling ([14]) and the m-out-of-n bootstrap with m/n — > 
would yield consistent approximations of the distribution of a. Unfortunately, these 
methods have problems of their own, some details of which can be found in [18] and 
[1]. Specifically, although subsampling is asymptotically consistent, it can perform 
miserably in finite samples. For any a € (0, 1), the actual asymptotic coverage of a 
standard level (1 — a) subsampling confidence interval can be zero; see [1] for de- 
tails. The finite sample properties of subsampling based methods can be improved 
sometimes by considering hybrid techniques, calibrations and other modifications, 
as documented by [2]. However, the asymptotic zero coverage of subsampling in- 
tervals for a cannot be reversed by, for example, size correction, since technical 
conditions that allow for such correction to work are not satisfied by a. 

The above issues with post-model-selection estimators lead to model-averaged 
estimators. A model-averaged estimator of a is of the form 

(2.5) a = a(R)p R + a(U) Pu , 

where pr and pu are two weights associated with the models R and U. Yang 
and his co-authors have extensively studied aggregation across models for several 
statistical procedures like estimators and forecasts, in both their algorithmic as well 
as theoretical aspects (see [22, 23, 24, 25]). In particular, a result of [23] implies 
that when the model averaging technique is strongly consistent, the supremum of 
the mean squared error of n 1/,2 (o! — a) over values of (a, (3) tends to infinity. Thus, 
strongly consistent model averaging docs not attain the minimax rate. Our result 
in Section 3 shows that, up to constant terms, it is no worse than the post-modcl- 
selection estimator when (a, (3) are held fixed. 

Recently, [5] studied several forms of model averaging and showed that a typical 
model-avcragcd estimator converges weakly to a mixture of normal laws, when 
the parameters of the true model are in a 0(n~ 1/l2 ) neighborhood of the simplest 
candidate in a nesting of models. Since subsampling does not seem to perform 
well in practice, it is important to study conditions on model weights under which 
bootstrap approximations of finite sample distributions hold, i.e., conditions under 
which the statistic under consideration is smooth and asymptotically normal (see 
[12], [13]). This is studied in Section 4. 

3. Risk profile of model-averaged estimators 

Several problems associated with the post-model-selection estimator can be at- 
tributed to its lack of uniformity, as discussed extensively by others [8] . One is the 
super-efficiency of a, for example, when BIC is used for model selection. The core 
problem of lack of uniformity in the convergence pattern of a is unavoidable - even 
with model averaging - when a strongly consistent model averaging technique is 
used, as described by [23] . In this section we show that when parameter values are 
fixed, model averaging is no worse than model selection, up to constant terms. 

Under the unrestricted model, U, we choose the prior on (a, (3) to be a standard 
mean zero, identity covariance bivariate Normal distribution, N(0,I). Under the 
restricted model, i?, the prior on a is a standard univariate Normal distribution, 
N(0, 1). We put equal prior weights, i.e., 1/2, on the models, so the prior odds is 
1. Our notation for the posterior probabilities of the two models are ir n u and Tr n R. 
Since a is known, without loss of generality we also assume a = 1 in this section. 
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Thus the Bayesian model-averaged estimator of a is 

(3.6) a B MA = n nU a(U) +ir nR a(R). 

We use the pre-selected, least squares estimators a(U) and a(R) as constituents 
of &bma, and consider the squared error loss function. The case where a general 
loss function is used, with a(U) and a(R) taken to be the Bayes estimators under 
models U and i?, is very similar. The following Proposition is our main result in 
this section. 

Proposition 3.1. The normalized risk of olbma, nR(a) = nE(&BMA — ct) 2 , 
satisfies supni?(a) < oo, for every fixed choice of a and (3. Hence, the integrated 

n 

normalized risk 



sup / nR(a)d\(a, (3) < oo 
n J a , 

for any probability measure A(-) that does not depend on n. 

Proof. In the following, we use C as a generic constant, not depending on the 
parameters a and (3 or the sample size n. 

Note that a(R) = a(U) + /?([/) ||Xi||- 2 < Xi,X 2 >■ Therefore, 

nR(a) = nE [Tr n (ja(U) + n n fia(R) — a] 2 
(3.7) < 2n£(d([/)-a) 2 + 2n||X 1 ||- 4 <X U X 2 > 2 E {nl R {3 2 (U)\ . 



Note that E(a(U) - a) = cr 2 ||Xi||- 2 E 



Vi- < X ± ,X 2 > D- 1/2 V 2 = Cn- 1 and 



EK 2 nR (3 2 {U). < 2f3 2 Eirl R + CrT 1 . Thus, we need suitable bounds for /3 2 EttI r . We 
now have 

PnR = m R (Y)/ (mu(Y) + mu (Y)) = ^ ( 1 + ) " < ^ 



mu(Y) V mu(Y) J ma(Y) 

Then, making use of the moment generating function of a x 2 random variable, we 
can deduce that 

2 



^f^rS) =Cn 2 ex P {-nC (a 2 + f3 2 )} 



for a particular constant Co- This yields, at (3.7), that 

nR(a) = Cn^ + Cn 3 /3 2 cxp {~nC (a 2 + f3 2 )} . 

which is bounded for every fixed (a, j3), as a function of n. The rest of the result 
follows. □ 

Remark 3.1. A lower bound for nR(a) can also be established using arguments 
similar to those above. With slight modification, the above approach using the 
moment generating function of a non-central \ 2 random variable can be used to 
provide an alternative proof of Theorem 2 of [23] . It can also be seen that even when 
(a, 0) vary over a compact set, the suprcmum of nR(a) over (a,/3) is unbounded. 
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4. Adaptive model-averaged estimators and the bootstrap 

The results of Hjort and Claeskens [5] and Leeb and Potscher [8] indicate that 
the post-model-selection estimator and many model-averaged estimators cannot be 
consistently bootstrapped. The problems associated with the risk behavior, and 
those associated with bootstrap approximation, arise from two different sources. 
Undesirable behavior of the risk function arises from considering scenarios as pa- 
rameters vary, while a major reason why the distribution of post-modcl-selection 
or model-averaged estimators cannot be approximated by bootstrap methods is 
because of lack of smoothness of the estimator, or lack of asymptotic normality. 

In this section we study the conditions on the model weights which are required 
for consistent bootstrap approximation of the distribution of the resulting model- 
averaged estimator. Clearly, since the distribution of d(U) can be approximated 
using the bootstrap, putting the entire weight on model U is an option. However, 
balancing between a(U) and a(R) can lead to a more efficient estimator. We pro- 
pose below a data-adaptive model weighing scheme that achieves the dual goals of 
reasonable efficiency and bootstrap consistency. 

A model-averaged estimator of a is of the form 

(4.8) a = d(R)p nR + a{U)p nU . 

Notice that we have adopted a different notation (p n R and p n u) f° r the model 
weights in this Section, from those {jt n R and Tt n u) used in Section 3. This is to 
emphasize that the nature of these weights may be different. We retain the condition 
that the parameters (a, (3) are fixed but unknown. 

A primary requirement for consistency is p n R -\~PnU = lj as pointed out in [5]. In 
order to avoid pathologies, we also specify that p n u € [0, 1]. Note that the weights 
p n R and p„u may depend on the parameters (a, /?), and the random component V, 
apart from the known constants X and a 2 . 

Replacing p n jj by 1 — p n R, we thus have 

a = a + *\\X 1 \\- 1 Vi+Pp nR \\X 1 \\-*<X 1 ,X 2 > 

-gWX^D- 1 ' 2 <X U X 2 > {l-PnR)V 2 . 

A primary requirement on a is that it should be consistent, and the following 
proposition establishes a necessary and sufficient condition for this. 

Proposition 4.1. The model-averaged estimator a converges in probability to a if 
and only if (3p n R converges in probability to zero asn^oo. 

Proof. The sufficiency part follows easily from the design conditions (2.2)-(2.3). 
For the necessity part, suppose that fipnR — > c ^ as )i - >oo. This is clearly 
equivalent to p n R Ac = c//J/0asmoo and (3^0. Hence, we also have 
(1 - PnR) [aWX^D- 1 ' 2 < Xi,X 2 > V 2 } (1 - c)0 = 0. This implies a 

a — C7 ^ a, where < X\, X 2 >^ 7 as in 00. The case where p n R does 

not have a limit can be treated similarly with a little more algebra. □ 

The next proposition is an extension of the previous one, and establishes sufficient 
conditions for asymptotic normality of a. 

Proposition 4.2. The scaled and centered model- averaged estimator n x / 2 {a — a) 
has an asymptotic normal distribution if (i) n 1 / 2 (3p n R converges in probability to 
zero as n — > 00, and (ii) p n R converges in probability as n — > 00 for all values of 
(a, (3). 
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Proof. The first condition forces the bias component in a to be o(n 1 ^ 2 ), while the 
second condition allows for use of Slutsky's theorem. □ 

By requiring n 1 / 2 (3p n R as n — > oo we have ensured that, when (3 ^ 0, we 
have n x l 2 p n R — > 0. Thus the model-averaged estimator is close to the unrestricted 
model estimator a(U), and has the same limiting distribution up to first order 
terms. However, when (3 = 0, the asymptotic distribution of n 1 / 2 (d — a) depends 
on the limit of p n R, which is between zero and one. Thus, when the restricted model 
holds, the asymptotic variance of a is between that of a(R) and a(U). The relative 
strengths of different candidates for model weight p n R may be evaluated by their 
probability limits when j3 = 0. We note that we consider (a, (3) as fixed constants 
and do not allow them to vary with n. If, for example, we assumed (3 = 0(n -1 / 2 ), 
then the first condition of Proposition 4.2 would imply asymptotically zero weight 
on the restricted model. 

In order to progress towards bootstrap consistency, apart from asymptotic nor- 
mality of a, we also need p n R to be a smooth function. Thus ruling out the indicator 
function p n R, = ■Tri Tl -i/2 <7 .-i j §m)|<c} used in a. Keeping in view the nice properties 
of a, we now develop an adaptive, data-driven model weight function p n R that is a 
smooth version of I { \ n -y 2(T -^ 0{u) \< c y 

For any k n , we split the event {— k n < [3{U) < k n } into two events, {(3(U) — k n < 
0} and {(3{U) + k n > 0}, and approximate the indicators of these events separately. 
Our approximation for l0nj\_k < } i s 

= (l + cxp{- 7l „(/3(C/)-fc n )}) exp{-7 1 „( ) 3(?7)-fc„)}> 
and for I { p {u)+kn > 0} is 

= (l + cxp{ 72n (/3([/) + fc n )}) 1 exp{ 72 „(/3((7) + fc„)}. 

We take the two tuning values 7 i„ and 7 2 n to be always positive. However, they 
change with n; and in a major departure from traditional model weights, they are 
not equal to each other, and also depend on the data. Thus, 7 i„ = 7 i„(a, (3, V) and 
7 2„ = 7 2n(ck,/9, V) are unequal, random weights. 
Equipped with these functions, we define 

PnR = 0.5£i„ + 0.5^2n- 

We adopt the paired bootstrap as our resampling strategy. Thus, we draw a simple 
random sample with replacement of the data pairs (Y*,x*), i = 1, . . . ,n, from the 
original data (Yi,Xi), i = l,...,n. The entire process of obtaining a(R), a(U), 
f3(R), p n R, and a is imitated with the resample (Y*,x*), i = l,...,n, and we 
approximate the distribution of n 1 / 2 (d — a) with the distribution of n 1 / 2 (d* — a), 
conditional on (Y^x;), i = l,...,n. A technical condition guarantees that the 
design matrix from the resampled data is non-singular with high probability; see 
condition (1.17) of [3]. 

The following Theorem is our main result in this section, and establishes consis- 
tency of the bootstrap for a adaptively weighted model-averaged estimator. 
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Theorem 4.1. Assume that sequence of constants k n j as n — > oo. Suppose the 
tuning constants are chosen as 7i„ = a„/3(C) j2n = —a>n/3(U) where {a n } is a 
sequence of positive constants satisfying a" 1 log(n) J. as n —* oo. 

TTien n 1 / 2 (d — a) /ias aw asymptotic Normal distribution and the paired bootstrap 
is consistent for it. 

Proof. For the asymptotic normality we only need to check that the conditions of 
Proposition 4.2 are met. We illustrate the calculation for verifying n 1 / 2 ^^ — ► 0, 
when (3 ^ 0. 



P 

= P 

< P 
= P 



\n 1,2 t ln \>e 

[(l + eatpj-Tm^tO-fcn)})" 1 

exp{-7i n (/3(f/)-fc n ) + 0.51og(n)}| >e 
exp {-ym0{U) - k n ) + 0.51og(n)} > e 

/3(C) lies between the roots of x 2 — k n x — 0.5a" 1 log(n) + a" 1 log(e) = 



The roots of the equation x 2 — k n x — 0.5a" 1 log(n)+a~ 1 log(e) = are always real 
when e < 1, since /c 2 + 2a" 1 log(n) — 4a" 1 log(e) > for all n. Note that the square 
of the distance between the roots is given by (/c 2 + 2a" 1 log(n) — 4a" 1 log(e)) /4. 
When k n | 0, k 2 + 2a" 1 log(ra) — 4a" 1 log(e) J, 0, hence the Lebesgue measure of 
the interval between the roots goes to zero as n — > oo, thus ensuring 



P /3(C) lies between the roots of x 2 — k n x — 0.5a n 1 log(n) + a n 1 log(e) = 



0. 



as n — > oo. Note that this result actually does not depend on the value of /3, as long 
as it is non-zero. 

Other parts of the proof for asymptotic Normality may be verified similarly. Since 
a is a smooth function of a, /3 and V, and has an asymptotic Normal distribution, 
the consistency of the paired bootstrap procedure follows from [12] and [13]. □ 

Remark 4.1. The condition k„ j as n — > oo is a weaker restriction than typically 
found in literature. Since /3(C) = O p (rt -1 / 2 ), the AIC criterion uses k n = 0(n -1 / 2 ), 
while the BIC uses k n = 0(n~ 1/2 ^log(n)). 

Remark 4.2. The assumptions of Proposition 4.1 and Proposition 4.2 cannot be 
weakened in general. The example of Section 10.6 of [5] provides a test case. It is 
a simpler version of the model described in Section 2, and simply has Y±, . . . , Y n 
independent, identically distributed as N(/j,, 1) random variables. Model uncertainty 
is about whether \i = 0, and the natural estimator for \i is Y n = n^ 1 Yi in 

the unrestricted model, and in the restricted model. A model-averaged estimator 
is (J, = W (ri^'^'YnjYn, for some weight W(-) £ [0, 1]. Note that under a model with 
contiguous alternatives AH rue = n~ x / 2 8, the requirement that p, be consistent for 
AHrue actually places no restriction on the weight W(-), which may take any value 
in [0, 1]. However, if we want consistency under arbitrary fi, W{n l ^Yn) -A 1 is a 
requirement. 

For asymptotic normality, n 1 / 2 /i(l — W^^Yn)) -A and convergence in proba- 
bility of W(n 1 ^ 2 Y n ), are requirements. Under AH rue , this implies that W^n 1 / 2 ^) — > 
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1 must hold, while for general //, the stronger condition n 1 / 2 (l — W^^Yn)) — > 
must be satisfied. 

Under AH rue , it is of interest to approximate the distribution of the standardized 
statistic 

A„ = n 1 / 2 ^ - Mtmc ) = n^ 2 W{n^ 2 Y n )Y n ~ 5 = W(6 + Z n ){5 + Z n ) - 5, 

where Z n ~ N(0, 1). 

A natural question is what should be a bootstrap equivalent of A n . Suppose 
Y* , . . . , Y* are a random sample from the data Yi , . . . , Y n . We consider the boot- 
strap equivalent of n l / 2 Y n to be n 1 / 2 ^* — Y n ), and not v}^ 2 Y*. This is in keeping 
with [4] , who put forth the guideline that for good power performance, resampling 
must be done to reflect the null hypothesis. While model selection is not in general 
a hypothesis test, some of the same principles arc applicable. 

Hence, we have fx* = W{n^ 2 (Y* - Y n ))Y*. When 1 - W{n x / 2 Y n ) 0, it can be 
readily seen that the distribution of A* = ?i 1//2 (/i* — fi), conditional on Yi, . . . , Y n , 
and that of A„ converge to the same limit law. □ 

Remark 4.3. We conjecture that for the model-averaged estimator proposed in 
this section, a result similar to [16] would hold. In the framework of this paper, 
the statement corresponding to the main result of [16] would be as follows: Let 
Fn,a,p{t) = P [n 1 / 2 (d — a) < t] , and let F n {t) be an estimator of F nt0l ^{t) satisfy- 
ing for every 5 > Pn,a,p[\ F n (t) — F n , a ^(t) |> 5] — > 0, as n — > oo. Then 3 <5 > 
and pa > such that 

(4.9) sup 

(S, / 9)eB((a,/8);po/Vn) 

where B((a,/3);a) = {(a J) : \\(aj) - (a,0)\\ < a 

is the open ball of radius a around (a, /?). It can be seen that under standard 
conditions, if the suprcmum in (4.9) is taken over B((a, /3); a n ) with a n = o(ri -1 / 2 ) 
instead of B((a, /?); po/\/n), the limit would be zero instead of 1. Thus the result 
of [16] may be improved to the case where the supremum is taken only over the set 
of parameter values that are exact order n -1 / 2 away from the (a, /3) under which 
the estimator F n (-) is computed. This is easily verified, for example, when a = 0, 
(7 = 1 and X t 2 = 1. 

Note that from a bootstrap approximation point of view, (4.9) is not a negative 
result, but a very positive one. The uses of bootstrap approximation are for con- 
structing interval estimates, testing hypotheses and so on. Equation (4.9) and other 
related results from [16] imply that a bootstrap approximation F n (-) constructed 
under the "null" (a, /?), has sup-norm distance of 1 from the true distributions 
under parameter values that are exact order n" 1 / 2 away from the (a, /?). Thus 
F n (-) has power 1 in hypothesis testing under contiguous alternatives. This is a 
further confirmation of the tenet of [4] , that resampling procedure ought to reflect 
the null hypothesis. 

Remark 4.4. It is of interest to know that the asymptotic variance of a depends 
on /3, and is given by Var(n 1/2 (d - a)) - Var(n 1/2 (a(?7) - a)) — ► if /3 ^ 0, while 
Var(n 1/2 (d-a))-{0.5Var(n 1/2 (a(J7)-a))+0.5Var(7i 1/2 (d(i?)-a))} -> 0if/3 = 0. 
This is established by checking that both £i„ and ^2n tend to 1/2 as n — > oo when 
(3 = 0. Thus a performs like the correct estimator a(U) when model U is valid, and 
balances between the correct and conservative choices when the restricted model R 
is true. 



\ F n (t) - F n a p(t) \> 5 Q -1; 
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5. A simulation example 

Wc performed a small simulation experiment to illustrate some of the features of 
inference under model uncertainty that have been discussed in the previous sections. 
We took n = 50, xn = 1, and generated 50 numbers from the Uniform distribution 
supported between zero and three and fixed these as the Xa values. We fixed a = 1, 
and varied the (3 values. 

For different values of (3 <G [— 1, 1], wc obtained sampling distribution approxima- 
tions of (i) the post-model-selected estimator 6cm s, (h) a version of the Bayesian 
model-averaged estimator cub ma, and (iii) an adaptive model-averaged estimator 
a am A, by 5000 replications for each value of (3. For the Bayesian model-averaged 
estimator, model R was assigned weight q n R = exp(— BICr/2)/ (exp(— BICr/2) + 
cxp(— BICjj/2)) while model U was assigned weight 1 — q n R. We define 



BICr = ^ [Fi - aflxaf + log(n), 
BICu = ^ Wi - atjjXii - Puxn 



21og(n). 



For the adaptive model-averaged estimator, we took a n = (log(n)) 2 . 

The requirement that a~ 1 log(n) j suggests that a„ should be an increasing 
sequence, growing faster than log(n). Several choices of a n were used initially, and 
it turned out that very slowly increasing sequences like a n = (log(n)) 2 or very 
quickly increasing sequences like a n ~ n 0A " performed better than others. This 
is a reflection on our way of constructing the functions £i„ and ^2n using 7i„ and 
72n- Alternative choices, like -fi n = a„|/3(t/)|{/3([/)} -1 , are a subject for further 
research. 

The first object of our study is the mean squared error of the three estimators of 
a, namely, &ms, olbma, and a am A- Panel (a) in Figure 1 contains the graphs of the 
mean squared error (MSE) as j3 varies between [—1, 1]. In this and all subsequent 
figures, the solid line corresponds to &bma, the broken line to 6ims, and the dotted 
line to &AMA- In this figure, we have also added the graph for the MSE of a(U), 
which is the nearly horizontal dot-and-dash line. First, using model selection or av- 
eraging is clearly better than using a(U) only in the region 0± 2/y / n w (—0.3, 0.3), 
where MS, BMA and AM A all perform better than a(U). However, in the neigh- 
boring regions |/3| G (0.3, 0.8), a{U) has smaller MSE than the three estimators. For 
high values of using model selection/averaging or the unrestricted model makes 
little difference. Thus whether model averaging/selection is useful or not depends 
considerably on the value of (3. Also note that B M A has a lower MSE compared to 
MS for low values of \P\ and only marginally higher MSE otherwise, with a much 
lower maximum MSE value. The graph for AM A tends to stay closest to the graph 
for a(U), and thus does better than BMA or MS in the region \(3\ <E (0.05,0.75), 
but is marginally poorer otherwise. 

In order to study how the three estimators balance between a(R) and a(U), we 
computed the Kolmogorov-Smirnov distances KSjR and KSju, between the dis- 
tribution of n 1 / 2 (d :) — a), and the distributions of n 1 / 2 ^^ — a) and n^^iaxj — 
a), where j = MS, BMA, AM A (MS: model selected, BMA=Baycsian model- 
averaged, AMA=adaptive model-averaged). We then computed the ratios 

KS Ratio, = 100——^%—, j = MS, BMA, AM A. 
3 KSjR + KSju 

Under ideal circumstances, this ratio ought to be zero at /3 = 0, and 100 for (3 ^ 0. 
Panel (b) in Figure 1 displays the KSRatioj values for the three estimators j = 
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(a) (b) 

Fig 1. Panel (a) is the mean squared error of Q-bma (solid line), &ms (broken line), &ama (dot- 
ted line), and a(U) (dot-and-dash line). Panel (b) is the ratio of Kolmogorov Smirnov distances 
KSRatioj = KSj R /(KS jR + KSju), j = MS,BMA,AMA, scaled by 100; between distributions 
of centered and scaled estimators and a(R) (for KSju) and a(U) (for KSju ). 



MS, BMA, AM A. When j3 = 0, MS is closest to a R , while, as predicted, AM A 
balances between 6tR and ajj- The Bayesian model- averaged estimator BMA lies 
between MA and AM A, and is quite close to MS. In the region ± 2/s/ri « 
(—0.3, 0.3) both MS and BMA are much closer to &r than ajj- 

Next, we studied resampling for the three estimators. Subsampling with sub- 
sample size m = 20 = QAn and the bootstrap was studied. Note that subsam- 
pling is consistent for all three estimators, but the bootstrap is consistent only for 
AM A. Panels (a) ((b)) of Figure 2, respectively, present the Kolmogorov-Smirnov 
distance, scaled by 100, between the distributions of n 1 / 2 (d J - — a) and its sub- 
sampling (bootstrap) version, j = MS, BMA, AM A. We present the graphs for 
|/3| < 0.4 « 3/y/n, since there is not much difference between the three graphs for 
other values of (3. It can be seen that the distances between the actual distribu- 
tion and its subsampling/bootstrap versions are much smaller for AM A, while the 
resampling approximations for MS and BMA are particularly bad in the regions 
{\(3\ £ (0.1, 0.3)}. Also, there is little visual difference between the accuracies of the 
subsampling and the bootstrap approximations despite their different asymptotic 
behavior, which confirms some of the observations made in [1], [2] and [18]. 
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(a) (b) 

Fig 2. Panel (a) is the subsampling approximation (subsample size 20) for the distribution of 
centered and scaled c\bma (solid line), &ms (broken line), olama (dotted line). Panel (b) is the 
corresponding bootstrap approximation. 



6. Discussion and conclusions 



The problems associated with post-model-selection estimation have been discussed 
by several researchers. In current statistical practice, the process of selecting a 
model has similarities with hypothesis testing. On the other hand, estimation of 
parameters, some of which may be known constants in some of the models, is 
generally entirely separated from model selection. Estimation and testing/selection 
are two different paradigms of statistical analysis that are hard to integrate. The 
lack of uniformity across models that parameter estimators generally display, and 
the issues that arise subsequently, are products of the less than successful attempt 
to combine the two processes of estimation and selection. 

In the Bayesian paradigm, model averaging seems to be a good integration of 
the two, since the selection step here is also an estimation exercise in spirit. The 
statement about integrated risks in Proposition 3.1 implies that Bayes' risks of 
model-averaged estimators are bounded. Thus, while minimaxity seems to be an 
elusive goal under model uncertainty, a fully Bayesian approach to analyzing risk 
behavior may be more successful. 

In the context of bootstrapping model-averaged estimators, an alternative to a 
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is to estimate the bias in a in all the models, and define a bias corrected average 
of these. As the bias of a(R) is < X\,X2 >, if we estimate this by 

/3||.Xi|| _1 < Xi,X-2 >, we get back a(U). Nevertheless, in more complex problems 
the "bias corrected model averaged" estimator may be an interesting object to 
study. 

In Theorem 4.1 we established the consistency of the paired bootstrap for a data- 
adaptive modcl-averagcd estimator. Two other kinds of bootstrap are available in 
the linear regression context; namely, parametric bootstrap and the residual-based 
bootstrap. When only one model is in use, the parametric bootstrap generates 
data from it using estimated values for the unknown parameters, while the residual 
bootstrap obtains residuals after fitting the model. The equivalents of these are not 
obvious under model uncertainty. 

In Section 4 we remarked that the data adaptive weights p n R and p„jj may 
not share the same properties as the posterior model probabilities 7r„^ and Tr n u of 
Section 3. It would be interesting to study when p n R and p n jj can be interpreted as 
posterior probabilities, and also under what conditions the frcqucntist properties 
of a Bayesian model-averaged estimator may be elicited using the bootstrap. 
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